Auditory scene analysis and hidden Markov model recognition of speech in noise

نویسندگان

  • Phil D. Green
  • Martin Cooke
  • M. D. Crawford
چکیده

We describe a novel paradigm for automatic speech recognition in noisy environments in which an initial stage of auditory scene analysis separates out the evidence for the speech to be recognised from the evidence for other sounds. In general, this evidence will be incomplete, since intruding sound sources will dominate some spectro-temporal regions. We generalise continuous-density hidden Markov model recognition to this ‘occluded speech’ case. The technique is based on estimating the probability that a Gaussian mixture density distribution for an auditory firing rate map will generate an observation such that the separated components are at their observed values and the remaining components are not greater than their values in the acoustic mixture. Experiments on isolated digit recognition in noise demonstrate the potential of the new approach to yield performance comparable to that of listeners. 1. AUDITORY SCENE ANALYSIS AS A PREPROCESSOR FOR SPEECH RECOGNITION Auditory scene analysis (ASA) describes the process by which listeners separate out and pay selective attention to individual sound sources within the mixture which reaches their ears [1]. Recent work at Sheffield [2,3] and elsewhere [4,5,6] has achieved some success in computational modelling of ASA based on grouping principles such as common onset, periodicity and good continuation of source components. If ASA depends on these unconditional, primitive processes, they may be viewed as a natural preprocessing stage for ASR. In contrast to most schemes for robust ASR (see Grenie & Junqua [7] for a review), this suggestion has the advantage that it does not require a model of the noise. Furthermore, there need be no assumption about how many sound sources are present, and the set of active sources may change with time. Fig. 1 presents quantitative results from previous segregation studies (Cooke & Brown [8]) in terms of two metrics – SNR and characterisation. The latter measures the percentage of the speech signal recovered from a mixture. This figure illustrates that whilst we achieve significant SNR improvements in each case, current auditory scene analysis algorithms typically recover rather less than 40% of the energy associated with a target source. This is not surprising, since we proceed on the basis of finding reasons to group components. Some time-frequency regions will be masked to such an extent that purely data-driven grouping is unlikely to recruit them. We describe such data as occluded speech, although the analogy with visual occlusion should not be taken too far. The work presented here explores the possibility that occluded speech might contain sufficient information for recognition, and proposes a two-stage approach to robust ASR: signal separation by auditory scene analysis followed by recognition of the (incomplete) segregated data. The main problem addressed is the modification of ASR techniques to handle such data. In [9] we showed that a straightforward adaptation of Kohonen nets maintains an encouragingly robust performance in a frame-by-frame phone labelling task when an increasing proportion of the input vector is unavailable (e.g. no significant deterioration up to 90% random removal using a filterbank representation). Such nets can also be trained on partial data. In this paper we report on recognition of occluded speech by hidden Markov models (HMMs). Section 2 describes modifications to the HMM probability computation for incomplete observation vectors. Section 3 demonstrates the results of an experiment in which simulated auditory scene analysis provides data for the modified Viterbi algorithm, comparing the results on digit recognition in multispeaker babble with listeners’ performance. Section 4 extends the approach by exploiting an auditory induction constraint. 2. HMM RECOGNITION OF OCCLUDED SPEECH VIA MARGINAL DISTRIBUTIONS In ASR using continuous density HMMs, each model state is associated with a probability distribution for the -dimensional observation vector modelled as a finite mixture of multivariate Gaussian distributions, so that the probability density function (pdf) of when the model is in state has the form: (1) characterisation S N R 0 10 20 30 40 50 60 70 80 90 100 separated mix

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speech enhancement based on hidden Markov model using sparse code shrinkage

This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...

متن کامل

Recognition of Occluded Speech by Hidden Markov Models

Previous work at Sheffield on computational models of auditory scene analysis has attempted to separate the acoustic evidence from simultaneous sound sources by techniques grounded in auditory grouping processes. For this work to be useful in automatic speech recognition, we need to develop recognition techniques which can cope with 'occluded' speech. The separation stage will group together co...

متن کامل

Automatic Sound Classification Inspired by Auditory Scene Analysis

A sound classification system for the automatic recognition of the acoustic environment in a hearing instrument is discussed. The system distinguishes the four sound classes ‘clean speech’, ‘speech in noise’, ‘noise’, and ‘music’ and is based on auditory features and hidden Markov models. The employed features describe level fluctuations, the spectral form and harmonicity. Sounds from a large d...

متن کامل

Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis

A sound classification system for the automatic recognition of the acoustic environment in a hearing aid is discussed. The system distinguishes the four sound classes “clean speech,” “speech in noise,” “noise,” and “music.” A number of features that are inspired by auditory scene analysis are extracted from the sound signal. These features describe amplitude modulations, spectral profile, harmo...

متن کامل

The Representation of Speech in a Nonlinear Auditory Model: Time-Domain Analysis of Simulated Auditory-Nerve Firing Patterns

A nonlinear auditory model is appraised in terms of its ability to encode speech formant frequencies in the fine time structure of its output. It is demonstrated that groups of model auditory nerve (AN) fibres with similar interpeak intervals accurately encode the resonances of synthetic three-formant syllables, in close agreement with physiological data. Acoustic features are derived from the ...

متن کامل

Comparison of Auditory Models for Robust Speech Recognition

Two auditory front ends which emulate some aspects of the human auditory system were compared using a high performance isolated word Hidden Markov Model (HMM) speech recognizer. In these initial studies, auditory models from Seneff [2] and Ghitza [4] were compared using both clean speech and speech corrupted by speech-like "babble" noise. Preliminary results indicate that the auditory models re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995